The goal of this project is to become familiar with plotting maps in R. The result itself (the routes I plan for future hikes) is novel and are not the main goal of the project. This is my first exposure to spatial work and all the code I produce is a result of self teaching.
The non-code related goal of this project is to produce a map of the lake district, and all corresponding wainwrights in the lake district. These wainwrights will then be coloured according to whether I have summitted them or not. To finish, I will cluster the remaining wainwrights in order to create a list of hikes I can do to summit every wainwright. This will include K-means clustering and the travelling salesperson problem.
Before we go any further, it is worth giving a brief explanation to those who dont know what a Wainwright is. Wainwrights are typically a small statue/pile of stones at the top of a peak which signify the finishing point of a peak. The name was coined by Alfred Wainwright who wrote about these 214 peaks in his book ‘Pictorial Guide to the Lakeland Fells’. Below is a picture of me summitting the 2nd highest peak in the lake district, Helvellyn.
A picutre of myself at the summit of Helvellyn, 2021
To start this project, I have sourced a CSV file of all wainwrights in the lake district. The structure of this file can be seen below, it is clear that some data wrangling is in order to get the data in the format I want. Each row is a string which includes an OS grid reference, peak name and height in Meters.
head(ListOfWainrights, 3)
For the time being, I am only interested in the OS grid reference and the peak name. The first part of the string is removed, we then split the information into 3 different columns, the height is kept in the data in case I want to re-use this table for a different purpose.
df_new <- separate(ListOfWainrights, Wainrights, into = c("Drop", "Keep"), sep = "^\\S*\\K\\s+")
df_new <- df_new[,2]
df_new <- separate(df_new, Keep, into = c("OSGRidRef", "Wainright"), sep = "^\\S*\\K\\s+")
WainrightsInfo <- separate(df_new, Wainright, into = c('Name', 'Height'), sep = '\\(.*')
library(stringr)
ElevationMeters <- as.numeric(sub("\\D*(\\d{3}).*", "\\1", df_new$Wainright))
WainrightsInfo$Height <- ElevationMeters
The OS grid reference is tricky to plot in R, instead, it is simpler to convert it to the WGS84 system. With this system, we get a latitude and longitude which can be plotted onto a map.
Coordinates <- osg_parse(WainrightsInfo$OSGRidRef, coord_system = "WGS84")
WainrightsInfo$lon <- Coordinates[[1]]
WainrightsInfo$lat <- Coordinates[[2]]
There are some basic checks which can be done to make sure the lat/lon values have translated properly. The package ‘maps’ can be segmented to plot just the UK, I then overlay the wainwrights latitude and longitude to test that they are plotting correctly, they should all be localised over the lake district, which they are.
library(maps)
library(ggplot2)
worldmap = map_data('world')
ggplot() +
geom_polygon(data = worldmap,
aes(x = long, y = lat, group = group),
fill = 'gray90', color = 'black') +
coord_fixed(ratio = 1.3, xlim = c(-10,3), ylim = c(50, 59)) +
theme_void() +
geom_point(data = WainrightsInfo,
aes(x = as.numeric(lon),
y = as.numeric(lat)), color = 'black', alpha = .7) +
scale_size_area(max_size = 8) +
scale_color_viridis_c() +
theme(legend.position = 'none') +
theme(title = element_text(size = 12))
To get a more in detail map, I discovered the package ggmap, this is part of the extremely useful GGplot family and has been fantastic for this project. Snippets of maps are taken from google earth, the boundaries for this map are determined by latitude and longitude which are provided manually. To select an appropriate box, I simply used the minimum and maximum lat/lon values from the Wainwright data and added a manual boundary with trial and error.
sq_map <- get_stamenmap( LakeDistrictBox, maptype = "terrain")
## Source : http://tile.stamen.com/terrain/10/501/325.png
## Source : http://tile.stamen.com/terrain/10/502/325.png
## Source : http://tile.stamen.com/terrain/10/503/325.png
## Source : http://tile.stamen.com/terrain/10/504/325.png
## Source : http://tile.stamen.com/terrain/10/501/326.png
## Source : http://tile.stamen.com/terrain/10/502/326.png
## Source : http://tile.stamen.com/terrain/10/503/326.png
## Source : http://tile.stamen.com/terrain/10/504/326.png
## Source : http://tile.stamen.com/terrain/10/501/327.png
## Source : http://tile.stamen.com/terrain/10/502/327.png
## Source : http://tile.stamen.com/terrain/10/503/327.png
## Source : http://tile.stamen.com/terrain/10/504/327.png
og_map <- ggmap(sq_map) + geom_point(data = WainrightsInfo, color = "red", size = 1)
plot(og_map)
In the interest of this project, I dont want to summit peaks which I have already been to. Luckily, I keep a note of all Wainwrights I have been to. This allowed me to colour the Wainwrights based on whether I have been there before.
WainrightsInfo$Name <- trimws(WainrightsInfo$Name, which = c("both"))
WainrightsInfo$Summitted <- 'No'
for (name in 1:length(WainrightsInfo$Name)){
if (WainrightsInfo$Name[name] %in% c('Sergeant Man', 'Helvellyn', 'Mardale Ill Bell', 'High Street', 'The Knott (High Street)', 'Rampsgill Head', 'Kidsty Pike', 'Holme Fell', 'Tarn Crag', 'Loughrigg Fell', 'Hallin Fell', 'High Raise')){
WainrightsInfo$Summitted[name] <- 'Yes'
}
}
WainrightsInfo$Summitted <- as.factor(WainrightsInfo$Summitted)
sq_map <- get_map(location = LakeDistrictBox, maptype = "satellite", source = "google")
ggmap(sq_map) + geom_point(data = WainrightsInfo, aes(color = Summitted), size = 1) + scale_colour_manual(values = c('red', 'green'))
The plot above makes me realise how far I have to go! One satisfying moment is creating a table which removes all peaks that I have been to.
NonSummittedPeaks <- WainrightsInfo[which(WainrightsInfo$Summitted == 'No'),]
head(NonSummittedPeaks, 3)
Now that all the remaining Wainwrights are isolated, they can be clustered into appropriately sized groups. At this point, the legitimacy of the solution is damaged a little bit. What I mean by this is the following, the distance between wainwrights are measured using the euclidean distance, this looks at the most efficient route between two wainwrights as the crow flies, it doesnt take into account elevation or viable walking paths. This means it might cluster wainwrights together as they are near to eachother on a map, but in reality, they may be hard to travel between. I continue this exercise as I want a final map which could be used, in theory.
The optimal number of clusters which minimises the distance between points is summarised below. As the number of clusters is increased, the error goes down which makes sense logically.
library(ClusterR)
## Loading required package: gtools
library(cluster)
##
## Attaching package: 'cluster'
## The following object is masked from 'package:maps':
##
## votes.repub
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
WainRightCoords <- NonSummittedPeaks[,c('lon', 'lat')]
set.seed(314) # Setting seed
elbowPlot <- fviz_nbclust(WainRightCoords, kmeans, method = "wss", k.max = 25)
plot(elbowPlot)
There is no correct answer here, we are looking for a point in the plot where increasing the number of clusters by 1 dramatically reduces the WSS. This happens at a low number of clusters, which is not appropriate for our problem as choosing 4 clusters would mean each hike containing roughly 50 Wainwrights to summit.
25 Clusters are chosen , meaning on average, each cluster/potential hike will have 8 peaks in it. These clusters are then plotted by colour. The top left cluster is an example of a hike which isnt suitable.
finalClustering <- kmeans(WainRightCoords, 25, nstart = 25)
NonSummittedPeaks$Cluster <- as.factor(finalClustering$cluster)
ggmap(sq_map) + geom_point(data = NonSummittedPeaks, aes(color = Cluster), size = 3)
#+ scale_colour_manual(values = c('red', 'green'))
From the information gathered, it is possible to create the most efficient ‘top down’ route which traverses all Wainwrights once. This ignores the clusters for now and the result can be seen below.
library(TSP)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(leaflet)
dist_mat <-
dist(
NonSummittedPeaks %>% dplyr::select(lon, lat),
method = 'euclidean'
)
tsp_prob <- TSP(dist_mat)
tsp_prob <- insert_dummy(tsp_prob, label = 'dummy')
tour <-
solve_TSP(
tsp_prob,
method = 'two_opt',
control = list(rep = 16)
)
path <- names(cut_tour(tour, 'dummy'))
NonSummittedPeaks <-NonSummittedPeaks %>%
mutate(id_order = order(as.integer(path)))
NonSummittedPeaks %>%
arrange(id_order) %>%
leaflet() %>%
addTiles() %>%
addCircleMarkers(
~lon,
~lat,
fillColor = 'red',
fillOpacity = 0.5,
stroke = FALSE
) %>%
addPolylines(~lon, ~lat)
This plot shows one (Extremely!) long hike. This route can be split into the clusters created earlier, but we then lose the efficiency created by the TSP. Nevertheless, this is a good map to produce, and the inital mapping can be seen below.
h <- ggplot(data=NonSummittedPeaks, aes(lon,lat, color = Cluster))
h <- h + geom_point()
h +
geom_path(data=NonSummittedPeaks,aes(color = Cluster))
In the above plot the TSP output is split by the clusters created
before. This is clearly not the most efficient route which can be taken,
there is alot of crossover between the hikes and the routes are not
efficient.
FinalMap <- ggmap(sq_map) + geom_point(data=NonSummittedPeaks, aes(lon,lat, color = Cluster)) +
geom_path(data=NonSummittedPeaks,aes(color = Cluster, group = Cluster), color = 'black')
FinalMap + theme(legend.position="none") + ggtitle('Lake District Wainright clustering')
To remedy this problem the data are split into separate tables for
every cluster. The travelling salesperson problem is then applied to
each cluster individually, this creates the most efficient route for
each cluster. The chunk of code below shows the creation of the
function, DistanceCalc, being created. This function is
then applied to the first cluster. The most efficient route for cluster
1 is plotted. This can be compared to the previous plot, the route looks
more efficient.
DistanceCalc <- function(cluster){
dist_mat <-
dist(
cluster %>% dplyr::select(lon, lat),
method = 'euclidean'
)
tsp_prob <- TSP(dist_mat)
tsp_prob <- insert_dummy(tsp_prob, label = 'dummy')
tour <-
solve_TSP(
tsp_prob,
method = 'farthest_insertion',
control = list(rep = 16)
)
path <- names(cut_tour(tour, 'dummy'))
cluster <-cluster %>%
mutate(id_order = order(as.integer(path)))}
cluster1 <- NonSummittedPeaks[which(NonSummittedPeaks$Cluster == 1),]
cluster1 <- DistanceCalc(cluster1)
cluster1<-cluster1 %>% arrange(id_order)
FinalMap <- ggmap(sq_map) + geom_point(data=NonSummittedPeaks, aes(lon,lat, color = Cluster)) +
geom_path(data=cluster1,aes(color = Cluster, group = Cluster), color = 'black')
FinalMap + theme(legend.position="none") + ggtitle('Lake District Wainright clustering')
The newly created function is applied to every cluster (this chunk of code is hidden as it is repetitive). The final output can be seen below. To summarise, this plot has taken all lake district Wainwrights which I have not summitted, they are then clustered using K means clustering and then for each cluster the most efficient route is calculated to traverse all of them. The main issue that occurs with this output is the lack of realism, the routes selected do not match actual routes on the ground, the height of these peaks is also not taken into account.
FinalMap <- ggmap(sq_map) + geom_point(data=NonSummittedPeaks, aes(lon,lat, color = Cluster)) +
geom_path(data=cluster1,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster2,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster3,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster4,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster5,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster6,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster7,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster8,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster9,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster10,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster11,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster12,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster13,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster14,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster15,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster16,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster17,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster18,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster19,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster20,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster21,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster22,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster23,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster24,aes(color = Cluster, group =1), color = 'black') +
geom_path(data=cluster25,aes(color = Cluster, group =1), color = 'black')
FinalMap + theme(legend.position="none") + ggtitle('Lake District Wainright clustering (25 Hikes)')
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?